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Summary 

Large contingency tables summarizing categorical variables arise in many areas. One example is in bi- 
ology, where large numbers of biomarkers are cross-tabulated according to their discrete expression level. 
Interactions of the variables are of great interest and are generally studied with log-linear models. The struc- 
ture of a log-linear model can be visually represented by a graph from which the conditional independence 
structure can then be easily read off. However, since the number of parameters in a saturated model grows 
exponentially in the number of variables, this generally comes with a heavy computational burden. Even 
if we restrict ourselves to models of lower order interactions or other sparse structures we are faced with 
the problem of a large number of cells which plays the role of sample size. This is in sharp contrast to 
high-dimensional regression or classification procedures because, in addition to a high-dimensional param- 
eter, we also have to deal with the analogue of a huge sample size. Furthermore, high-dimensional tables 
naturally feature a large number of sampling zeros which often leads to the nonexistence of the maximum 
likelihood estimate. We therefore present a decomposition approach, where we first divide the problem into 
several lower-dimensional problems and then combine these to form a global solution. Our methodology 
is computationally feasible for log-linear interaction models with many categorical variables each or some 
of them having many levels. We demonstrate the proposed method on simulated data and apply it to a 
bio-medical problem in cancer research. 



Key words: Categorical Data, Graphical Model, Group Lasso, Log-Linear Models, Sparse Contingency 
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1 Background 

We consider the problem of estimation and model selection in log-linear models for large contingency 
tables involving many categorical variables. This problem encompasses the estimation of the graphical 
model structure for categorical variables. This structure learning task has lately received considerable 
attention as it plays an important role in a broad range of applications. The conditional independence 
structure of the distribution can be read off directly from the structure of a graphical model (a graph) and 
hence provides a graphical representation of the distribution that is easy to interpret (see Lauritzen ( 1996| l). 



Graphical models for categorical variables correspond to a class of hierarchical log-linear interaction mod- 
els for contingency tables. Thus, fitting a graph corresponds to model selection in a hierarchical log-linear 
model log(p) = X/3, where p is the vector of cell probabilities of a table, /3 is a parameter vector and X 
is the design matrix (see section [Z2] i. 

Fitting the log-linear model for large contingency tables in full detail turns out to be a very hard com- 
putational problem in practice. In the following, we list three possible goals for fitting log-linear models. 
The goals are ranked in increasing order according to their computational difficulty: 
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Graphical Structure: Finding the graphical structure for discrete categorical variables is easiest but it 
doesn't allow to infer the magnitude of the coefficients f3 in the log-linear model, see also formula 
©. 

Parameter Vector /3: The next level of difficulty is the estimation of the unknown parameter vector 
/3 in a log-linear model whose full dimension equals the number of cells in the contingency table. 
For large tables, the dimension of /3 is huge but under some sparsity assumptions it is possible to 
accurately estimate such a high-dimensional vector using suitable regularisation. The major problem 
is here that besides the high-dimensionality of (3, the analogue of the sample size (the row-dimension 
of X) is huge, e.g. 3'^^ for 40 categorical variables having 3 levels each. 

Probability Vector p: The most difficult problem is the estimation of the probability vector p whose 
dimension equals again the number of cells in the table. It is rather unrealistic to place some sparsity 
assumptions on p in the sense that many entries would equal exactly zero which would enable feasible 
computation. Therefore, it is impossible to ever compute an estimate of the whole probability vector 
p (e.g. having dimensionality 3"'"). Nevertheless, thanks to sparsity of the parameter vector /3 and the 
junction tree algorithm, it is possible to compute accurate estimates {p{i); i £ C} for any reasonable- 
sized collection C of cells in the contingency table. 

There is hardly any method which can achieve all these goals for contingency tables involving many, 
say more than 20, variables. One approach to address the log-linear modeling problem for large contin- 
gency tables is presented in IJackson et al. ( 2007| l, where some dimensionality reduction is achieved by 



reducing the number of levels per variable. The reduction is accomplished via collapsing two categories 
by aggregating their counts if the two categories behave sufficiently similar If d variables are considered, 
this method reduces the problem at best to d binary variables. For this special case with binary factors, an 
approach based on many logistic regressions can be used for fitting log-linear interaction models whose 



computational complexity is feasible even if the number of variables is large (Wainwright et al. 20071. 
Another method to address the log-linear modeling problem for large contingency tables is proposed in 
[Kim, (2005 ), where the variables are grouped such that they are highly connected within groups but less 
between groups and graphical models are fitted for these subgroups. The subgraph models are then com- 
bined using so-called graphs of prime separators. The implementation of the combination however is not 
an easy task and no exact algorithm is given on how to combine the models. 

Our presented methodology allows to achieve all the goals above for categorical variables having possi- 
bly different numbers of levels: inference of a graphical model for discrete variables, of a sparse parameter 
vector in a log-linear model and of a collection of cell probabilities. Motivated by the approach in Kim] 
( 2005| l, we also propose a decomposition approach, where the dimensionality reduction is achieved by re 



cursively collapsing the large contingency table on certain variables (decomposition) and thereby reducing 
the problem to smaller tables which can be handled more easily. All the fitted lower-dimensional log-linear 
models are combined appropriately to represent an estimation of the joint distribution of all variables. The 
procedure enables us to handle very large tables e.g. up to hundreds of categorical variables, where some 
or all of them can have more than two categories. This multi -category framework is much more challeng- 
ing than the approach in [Wainwright et ar| ( |2007[ ) for large binary tables. In |Ravikumar et a/.] ( |2009| l, an 
extension to the multi-category case within the class of pairwise Markov random fields is sketched: it is 
argued that higher-order interactions could be included in a pairwise Markov field but no methodology or 
algorithm is described how to actually do this difficult computational task. 



2 Definitions 

In this section, we introduce several important theoretical concepts. First, we define the general log-linear 
interaction model. Subsequently, we introduce the hierarchical log-linear model and the graphical model as 
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restricted versions of the log-linear interaction model. Then, we define collapsibility and decomposabiUty, 
which are crucial concepts for breaking up a graphical model into smaller pieces. 

2.1 Log-linear interaction model 

We adopt here the notation of |Darroch et a/.| ( [l980| l. Assume we have some factors or categorical variables, 
indexed by a set V . Each factor v &V has a set of possible levels /i, = {0, 1, . . . , k^}. The contingency 
table is the cartesian product of the individual sets: / — ni,ev ^v- An individual cell in the contingency 
table is denoted by i e / and the corresponding cell count by rij. A marginal index for variable set a C y 
is denoted by ia- For example, assume we have two binary variables, then V — {1, 2}, Ii — I2 — {0, 1} 
and the contingency table is given by / = {(0, 0), (0, 1), (1,0), (1, 1)}. An individual cell is for example 
i = (0, 1) and the index describing the margin along the second variable (a = 2) is Zq = Z2 = (!)■ The 
total number of cells in a contingency table is m = |/| = IlDev l^fl- '^^^ example, m — 4. A natural 
way of representing the distribution of the cell counts is via a vector of probabilities p = « € /). 

In our example, this would correspond to defining four probabilities p — (P(o,o)jP(o,i)iP(i.o)>P(i.i))- If ^ 
total number of n individuals are classified independently, then the distribution of the corresponding cell 
counts n = (ni, 712, ... , rim) is multinomial with probability p. Finally, the general log-linear interaction 
model specifies the unknown distribution p as follows: 

logp(i) - ^ ea(«a) VZ e / (1) 
aCV 

where are functions of cell i which only depend on the variables in a. These functions are called 
interactions between the variables in a. If \a\ = 1, is called main effect, if |a| = 2 first order interaction 
and an interaction of order fc — 1 if |a| = k. For identifiability purposes we impose constraints on the 
functions, namely that A;*''-order interaction functions are orthogonal to interaction functions of lower 
order. 

2.2 Hierarchical log-linear models 

A hierarchical log-linear model is a log-linear interaction model with the additional constraint, that a 
vanishing interaction forces all interactions of higher order to be zero as well: 

Ca = 0=^^b = yaCbCV 

Hierarchical models can be specified via the so-called generators or generating class Q which is a set of 
subsets of V consisting of the maximal interactions which are present. More precisely, the generating class 
Q has the following property: 

= <S=^ there is no (7 e with a C q. (2) 

Consider an example with three binary factors. An example for a hierarchical log-linear model is the 
model consisting of all main effects, an interaction between 1 and 2 and an interaction between 1 and 3: 
this corresponds to Q = {{1, 2}, {1, 3}}. However, the log-linear interaction model with main effect 1 and 
interaction between 1 and 2 (but no main effect 2) is not in the class of hierarchical log-linear models. 
If we go back to formula ([T} and rewrite it in matrix formulation, we get: 

log(p)=X/3, (3) 

where /3 is a vector of unknown coefficients and X e M'" ^ ™ the design matrix. Each row of X corresponds 
to a certain cell and the columns of X correspond to the functions ^a(ia)- The number of columns needed 
to represent the function depends on the number of different states ia can take on. For example consider 
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a categorical variable a that can take on 3 levels. Then, is called a main effect (as \a\ = 1) and Xa 
(the columns of X corresponding to a) is 2-dimensional. Originally, it would be 3-dimensional but for 
identifiability purposes, the subspace spanned by Xa is chosen orthogonal to the already existing columns 
of lower order interaction (here orthogonal to the intercept) and we further choose it orthogonal within the 
subspace. By choosing the identifiability constraints this way, the parameterization of the matrix used in 
([3]) is equivalent to choosing a poly contrast in terms of ANOVA. If we go back to our example with two 
binary factors where m = 4, Q becomes: 



logp 



/(0,0)\ 
(0,1) 
(1,0) 

V(i'i)/ 



-X/3, 



(4) 



with X = 
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and f3 = (/30,/3i,/32 



where the first column is the intercept and belongs to a = 0, the second column belongs to a = 1 and has 
entry 1 whenever variable a = 1 takes on the first level and -1 else and similarly for the third column. The 
fourth column belongs to the interaction between variable 1 and 2. A description of X for the general case 



can be found in |Dahinden et al. ( 2007| l. In the following we will denote the components of /3 belonging to 

Xa with I3a- 



2.3 Graphical models 

We first introduce some terminology. A graph is defined as a pair G — {V, E), where V is the set of 
vertices or nodes and E C V x V is the set of edges linking the vertices. Each node represents a (cat- 
egorical) random variable. Here we only consider undirected graphs which means that {u, v) £ E is 
equivalent to {v, u) £ E. A path from u to w is a sequence of distinct nodes vq — u,vi, . . . ,Vn = 
V such that (w^, f^+i) € E for all i € {0, 1, . . . , n — 1}. Given three sets of variables A,S,B C V, we 
say that S separates A from B inV if all paths from vertices in A to vertices in B have to pass through S. 
Consider a random vector Z = {Zy,v E V} with a given distribution. We say that the distribution of Z is 
globally Markov with respect to a graph G if for any 3 disjoint subsets A, S, B C V the following property 
holds: 



S separates A from B =^ Za-LZb\Zs, (5) 

where the symbol "_L" denotes (conditional) independence. This states that we can read off conditional 
independence relations directly from the graph if the distribution is globally Markov with respect to the 
graph. Graphical models therefore provide a way to represent conditional (in)dependence relations between 
variables in terms of a graph structure. We say that a set of nodes of G forms a complete subgraph of G if 
every pair in that set is connected by an edge. A maximal complete subgraph is called a clique. 

The undirected graphical model (from now on "graphical model" for short) represented by a graph G 
corresponds to a hierarchical log-linear model where the cliques of the graph are the generators of the 
model. If we go back to our example in the previous section and assume that (3i2 ^ in formula Q, then 
the hierarchical log-linear model Q can be represented by the graphical model in Fig. [T|a); if /3i2 — 0, 
then the corresponding graphical model is the one in Fig. [ijb). 

Conversely, assume that the generators of a log-linear model are given by a set Q. By connecting all 
the vertices appearing in the same generator with each other and placing no other edge, the so-called 
interaction graph is built. By the definition of the interaction graph and by looking at formula ([T]i, it 
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Fig. 1 Examples of graphical models, (a) Full (maximal) model corresponding to the example given by formula (j4|. 
(b) In the same example graph with P12 — 0. (c) Example of an interaction graph corresponding to a hierarchical 
log-linear model which is not graphical, (d) The separator S = 2 has index 1^(3) — 2. (e) The separator S = 2 has 
index i'(S) — 1. 




3 




becomes clear that the distribution induced by the log-linear model is Markov with respect to the interaction 
graph and we can read off conditional independencies directly from the graph. It is also clear that Q 
corresponds to a graphical model via its interaction graph if and only if Q is the set of cliques of this graph. 
In that case we say that C/ is a graphical generating class. If there are cliques in the interaction graph 
which are not in Q, the hierarchical log-linear interaction model is not graphical and its interaction structure 
cannot be adequately represented by the graph alone. However, the graph may still completely represent 
all conditional independencies of the underlying distribution. The simplest example of a hierarchical log- 
linear model AI which is not graphical is y = {1, 2, 3} and Q = {{1, 2}, {2, 3}, {3, 1}}. Its interaction 
graph is shown in Fig. [TJc) which has as its only clique the complete graph {1,2,3}. Since the set of 
cliques and the set of generators Q are not identical, model M is not graphical. 

Any joint probability distribution of discrete random variables can be expanded in terms of a log-linear 
interaction model. For some distributions it is possible to represent all (conditional) independencies in an 
undirected graphical model and these distributions are called faithful to their interaction graph G or we 
say that the graph is a perfect map of the distribution. In other words, the graph captures all and only the 
conditional independence relations of the distribution. 



2.4 CollapsibUity 

Collapsing over a variable simply means summing over that variable and thereby reducing (collapsing) the 
table to the remaining dimensions. 

When collapsing over a variable, spurious associations between the remaining variables may be intro- 
duced and original associations can vanish. A criterion that addresses this problem is collapsibility . We say 
that a variable is collapsible with respect to a specific interaction when the interaction in the original 
contingency table is identical to the interaction in the collapsed contingency table (i.e., in equation Q 
remains the same). 



The general result regarding collapsibility, which goes back to a theorem stated in Bishop et al. ( 1975 1, can 
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be summarized as follows: 

By collapsing a table over a variable which interacts with s other variables, then s- and higher 
order interactions between the remaining variables are not changed in the collapsed table. (6) 
Conversely, lower-order interactions between the remaining variables are affected by collapsing. 

Suppose, for example, that variable X only interacts with one other variable Y. If we collapse over X, no 
interaction changes but main effects may be changed. Furthermore, suppose that Z is independent of all 
other variables. Then neither main effects nor interactions change when collapsing over Z. 

2.5 Decomposability 

Assume a graphical model on the vertex set V. A triple of disjoint subsets {A, S, B) of the vertex set V 
forms a decomposition if (i) = A U S* U i?, (ii) S separates A from B and (iii) S is complete. 

Decomposability is defined recursively: A graph is decomposable if it is complete or if there exists 
a decomposition (A, 5, B) where the subgraphs of G restricted to the vertex sets A U 5 and S U B are 
decomposable. 

Denote by C the set of all cliques of a decomposable graph and by S the set of all separators. For a 
decomposable graph with decomposition into cliques C and separators S, the probability of a cell i is given 
by the following formula (see for example Proposition 4.18 of Lauritzen ( 1996| l): 



where v{S) is the so-called index of the separator. The formal definition of the index v{S) is a bit cumber- 
some and is given in Lauritzen ( 1996| l. However, intuitively it can be thought of as the number of times the 



set S acts as a separator. For example: In Fig. [TJd), node 2 separates {1} and {4} (the cliques consisting of 
single nodes {1} and {4}), {1} and {3}, and also {3} and {4}. Therefore the index of the separating node 
2 is 3, as node 2 acts three times as separator. If we look at Fig. [TJe), we see that the node 2 only separates 
{1} from the clique {3, 4} as the single nodes 3 and 4 are no longer cliques since there is an edge between 
them. Therefore, the node 2 only acts once as separator and thus the corresponding index is 1 . 

It might not be possible to decompose the graph into decomposable components. By definition, this 
is the case for non-decomposable graphs. The simplest example of a non-decomposable graph is given in 
Fig. [TJf)- For a non-decomposable graph one can always add a minimal number of edges to the graph, such 
that it becomes decomposable (this step is called minimal triangulation; see |01esen and Madsen (2002|). 
Formula (|7]i also holds for such triangulated graphs. 



3 Estimating a Log-Linear Model Using a Decomposition Approach 

In this section, we propose our novel method for recovering both the graph structure and the parameters 
of a discrete graphical model. The underlying idea is to learn models over small sets of variables and 
stitch them together. More precisely, we first estimate an initial graph using node-wise regressions, add 
triangulations if necessary to make this graph decomposable and then use the decomposed components 
as the smaller sets of variables. The smaller sets are then analyzed one at a time and stitched together 
using formula (|7]i. In the following, we will describe each step of our method in more detail. The details 
of log-linear model selection on the smaller sets of variables are deferred to section]?] An outline of our 
decomposition approach for log-linear model estimation is given in the display Algorithm]!] 

3.1 Estimation of initial graph by node-wise regression 

If we knew the underlying true graph and if it was sparse, we could use a decomposition and collapse the 
contingency table on sub-tables given by the cliques C and the separators S from the decomposition. Then 
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input : Node set V — {Zi, . . . , Zp}, data matrix DonV,C = {},S= {}, M — {}, s^ax 
output: Estimated log-linear model 

// Estimate importance matrix using node-wise regression (see 
section |3 . l[ ) 

1 Set M = R= R;=p*p matrix consisting of O's 

2 for i in l:p do 

3 Do regression Zi ^ V \ {Zi} using c forest on D 

4 M j _i := importance measure of regression for each covariate 

5 Ri,i:p := ranks of i:p (small number corresponds to small rank) 

6 end 

7 Rij :=max(Rij,Rj,i) 

// Triangulation and Recursive Decomposition (see section |3 . 2| ) 

8 G := complete graph on V 

9 while |F| > Odo 

10 if G is not decomposable then 

11 G := minimal triangulation of G 

12 else 

13 G := G 

14 end 

15 find any minimal clique G of G 

16 if |G| < s,nax then 

17 split G into minimal separator S and rest A: S = C 

18 record G in C, in 5 

19 estimate log-linear model on G (see section ' 



4.1 



and save in M. 



20 V :=V \A,G := subgraph of G on V, R:=Rig\/,jgy 

21 else 

22 {i, i) := index of minimal entry in R 

23 G := G where edge between i and j is deleted 

24 -E^i.j — '■— 

25 end 

26 end 

// Combination of results (see section |3 . 3| ) 

27 logp(i) = Ecec logP(«c) - EsG5 '^('5') logp(«s) (p{is) and p{ic) were recorded in M) 
Algorithm 1: Outline of our decomposition approach for log-linear model estimation. 



we could perform model selection in the collapsed tables and combine the estimates according to formula 
(|7|. Of course, we do not know the graph and therefore we don't know C and S for the decomposition. In 
this section we propose a method of how to come up with an initial graph estimate. 

A log-linear model measures the associations among the variables. The association between two vari- 
ables can also be measured by doing regression from one variable upon the others. It is thus reasonable to 
apply a regression method to find groups of variables which are highly associated within a group but only 
weakly associated between groups. 

See for the following also Algorithm [T] lines 1 to 7. Assume a graphical model on the node set V = 
{1, . . . ,p} with corresponding random variables Z = {Zi, . . . , Zp}. For every node i in the graph, we run 
a regression with Zi as response variable and all remaining variables V \ Zi as covariates. We then draw 
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an edge between nodes i and j if and only if the covariate Zj has an influence on the response Zi. Ideally, 
the regression method involves interactions among the covariates (unless we assume a binary Ising model 
as in |Wainwright et a/.| ( |2007l l; |Ravikumar et a/.| ( |2009) ). 

Inspired by Kim ( 2005| l, we use a non-parametric regression approach. But instead of their single 
regression tree strategy we use a Random Forest approach (see |Breiman| ( |2001| l). Trees can naturally 
incorporate interactions between variables without running severely into the curse of dimensionality and 
are therefore ideal for our purposes. We prefer Random Forest instead of a single tree, since Random 
Forest oftentimes yield much more stable results for variable selection. 

There are three common ways of measuring the importance of individual variables in Random Forest 
regression. First, the importance measure can be the number of times a variable has been chosen as split 
variable (selection frequency). Second, the decrease in the so-called Gini index can be used. Third, the 
permutation accuracy which measures the prediction accuracy before and after permuting a variable can 
be used. 

By performing node-wise regression from each variable on all others and using any importance measure 
mentioned above, an importance matrix M can be built whose element Mi,j describes the importance of 
variable Zj (acting as covariate) to the variable Zi (acting as response). Note that this matrix is not 
symmetric. 

There is one technical difficulty, which we would like to mention here. A high entry in the importance 
matrix indicates a strong association between the corresponding row and column variable. However, de- 
pending on the importance measure, the values between various predictor variables as well as between 
different regressions might not be comparable. It has been shown that popular importance criteria in Ran- 
dom Forest such as the Gini index, the selection frequency or the permutation accuracy are all strongly 
biased towards variables with more categories (see Strobl et al. (2007i). We therefore use the "cforest" 
method proposed by Strobl et al. ( 2007 1 which provides a variable importance measure that can be re- 
liably used for variable selection even in situations where the predictor variables vary in their scale of 
measurement or their number of categories. 

Still, however, the importance measures are only consistent within rows but not between rows as the 
variable importance not only depends on the predictor variables in a regression but also on the response 
variable. Therefore, values cannot be directly compared between rows. For that reason we only consider 
the ranks of the importance matrix entries within rows. This yields the importance matrix of ranks R with 
R-i,i:p — i'ank(Mj 1, . . . , p), where small numbers correspond to small ranks. Thus, two variables Zi 
and Zj are strongly (conditionally) dependent if and only if R, j and R j ^ are both large. In this context, 
we define a symmetrized importance matrix of ranks R with R^ j = max(Ri j,Rj i) Vz,j G V. An 
initial graph estimate could now be obtained by placing only those edges, whose corresponding entry in R 
is larger than a given cutoff. 



3.2 Triangulation and recursive decomposition 

Since it is not clear how to find a suitable cutoff for the initial graph, we employ a recursive scheme (see 
Algorithm [T] lines 8 to 26). First, we start with a complete graph Gq. Then, we delete the edge with the 
smallest importance according to R. If the resulting graph Gi is not decomposable, we extend it to its 



minimal triangulation Gi, which is guaranteed to be decomposable (see section 2.5 1. If Gi has a clique 
that is small enough to be analyzed on its own (this depends on computational resources and could be the 
case for s„iax = 10 binary variables), we split it off. If it has no such clique, we keep deleting edges in 
Gi according to smallest importance (and set the corresponding entry of deleted edges in R to infinity, 
so that it is not considered for deletion again) until there is a small enough clique. Let's assume that 
AU S corresponds to a clique in the (triangulated) thinned graph, S separates A from V \ {A U 5*} and 
|j4 U Sj < Sjnax = 10. We then spUt off this clique (and we then estimate a log-linear model on it, see 
sectionK.lb. 
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From here, we restart the whole procedure with the remaining graph, i.e. setting Go := \ A. As 
coiTesponding importance matrix, we take only those rows and columns of R which correspond to the 
nodes inV \ A. This is repeated until the remaining graph consists of cliques whose cardinalities are less 
or equal to Smax- Therefore, the amount of edges which we recursively delete depends on the maximal size 
of cliques which we allow, denoted by s^ax which is a tuning parameter of our procedure of initial graph 
estimation. As mentioned above, Smax is usually chosen by computational requirements as the optimal 
size of the initial graph seems to be of minor importance. 

Note that it is not crucial that we have the sparsest possible subgraphs to collapse on, as long as any 
reasonable log-linear model selection procedure can be applied for these subgraphs. Such a decomposition 
is implemented in the R package LLdecomp, available at http : / / www . r-pro ject . org. 



3.3 Combination of results 

Assume we have collapsed the table on the cliques C and separators S of the graph induced by the recursive 
thinning and decomposition procedure. Furthermore, assume that we have fitted a model for each of these 
sub-tables (the collapsed tables on C and S). We then get the log-linear model corresponding to the full 
graph by using formula (|7]): 

logp(i) = logp(ic) - ^(-^^ logP(«s) 

cec ses 

= ^ X,/3c - K5)Xs/3s, (8) 
cec ses 

where Xq and Xg are the design matrices resulting from restricting the total design matrix to nodes in C 
and S respectively. The same notation applies to f3c and f3s- Formula ([8]) describes how to aggregate the 
results of the collapsed tables. In addition, one can derive from ([8]) that if we have 3 disjoint subsets A, S, B 
where S separates A from B, then we can safely collapse over B without changing an interaction between 
variables in A or between variables consisting of a mix of A and S. The only interactions which might 
change are the ones between variables which are exclusively in S (in the following denoted by separator 
interactions). This is in accordance with the result stated in (|6]l. But as formula ([8]) holds, the introduced 
interactions have a very small (3 coefficient. We therefore expect that if we threshold the parameter vector 
/3, most of the introduced zeros belong to so-called separator interactions ^: 3S G S with ^ e 5, i.e. 
interactions exclusively contained in a separator. Consequently, we set the threshold that the introduced 
zeros belong to equal parts to separator- and non-separator interactions. See Figure [2] for a graphical 



illustration of the procedure. In section p3] we will argue empirically that such a thresholding rule works 
well. 



4 Graphical Model Selection Procedures 



In this section, we state five methods for log-linear model selection. Three of them (DGL, DSF and DF) 
can be used for model selection within our proposed decomposition approach (see line 19 in Algorithm[T]). 
The remaining two (WW and RF) are established alternative methods and will be used for comparison. 

Our methods use up to three tuning parameters: For decomposition of the graph, we use a bound on the 
size of cliques in our graph (Smax; see line 16 of algorithm [T}. Furthermore, in section 3.3 we introduce 
a threshold for coiTecting the interaction-adding effect of collapsing to marginal models. Finally, in this 
section we use regularization parameters for selecting the graph within components. 



4.1 Model selection for decomposition approach 



The first method we propose is inspired by the Lasso, originally formulated by Tibshirani ( 1996 1 for es- 
timation and variable selection in linear regression. Extending this idea, a model selection approach for 



Copyright line will be provided by the publisher 



10 C. Dahinden, M. Kalisch, and P. Biihlmann: Decomposition and Model Selection for Large Contingency Tables 



Fig. 2 Illustration of how many separator edges to take up into the model, x-axis: the fraction of separator interactions 
^ with f3^ — among all separator interactions, y-axis: the fraction of non-separator interactions with estimated 
interaction coefficient equal zero. The points correspond to different levels of thresholding. We see that if we threshold 
30% of the non-zero /3 coefficients, we have almost exclusively thresholded separator interactions, as we would expect. 



Fraction o1 separator interaotions estimaterj zero 



log-linear models has been developed i n [Dahinden et al. (2007 1. The coefficient vector (3 is estimated with 
the group-lasso-penalty ( [Yuan and Lin[ [2006 1: 



/3 = arg min 



-i/(/3) 



aCC 



(9) 



where l{(3) = E"i"»(X/3), = logp;3[n] + c and \\f3Jl = Ejif^a)]- Therefore l{f3) is up to an 
additive constant c, which does not depend on /3, the log-likelihood function. This minimization has to 
be calculated under the additional constraint that the cell probabilities add to 1. The group-lasso-penalty 
has the property that the solution of (|9]l is independent of the choice of the orthogonal subspace of Xa and 
furthermore, the penalty encourages sparsity at the interaction level. Thus, the vector 0^ corresponding 



to the interaction (see section 2.1 1 has all components either non-zero or zero. Furthermore, by using 
group-lasso-penalty model selection we avoid the sampling zero problem, which is problematic regarding 



the existence of the MLE (see e.g. Christensen ( 19911). The tuning parameter A can be assessed by cross- 
validation: we divide the individual counts into a number of equal parts and in turn leave out one part and 
use the rest to form a training contingency table with cell counts ritrain- 

We abbreviate our decomposition approach using group-lasso-penalty for arbitrary values of A by DGL 
("Decomposition Group Lasso"). When A is chosen by cross-validation, we abbreviate the method by 
DGL:CV. If, in addition to that, hard-thresholding for the parameter vector /3 is used in the specific way 



described in section 3.3 we abbreviate this method by DGL:F ("Final model of suggested procedure"). 



Our second method is a stepwise forward procedure and aims to minimize the AlC-type criterion sk — 
21og(Z), where I is the maximized value of the likelihood function for the corresponding model with 
k degrees of freedom; 5 = 2 corresponds to the genuine AIC. Here we also vary the parameter s; a 
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large parameter leads to sparser models. We abbreviate this method for arbitrary values of s by DSF 
("Decomposition Stepwise Forward") and by DSF:AIC if s = 2. 

Finally, we propose a third method where no model selection is performed after decomposition and the 
MLE on the decomposed model is used. This corresponds to DSF with s = 0. We abbreviate this method 
by DF ("Decomposition Full model"). 



4.2 Alternative approaches for comparison 

Our first method for comparison is given in [Wainwright et al. { 2001) , where the problem of estimating 



the graph structure of binary valued Markov networks is considered. They propose to estimate the neigh- 
borhood of any given node by performing -penalized logistic regressions on the remaining variables 
using some penalty parameter A. Assume we have d binary random variables and observations thereof 
z = (zi, . . . , Zd) G {0, 1}''. Furthermore, we assume that the data are generated under the so-called Ising 
model: 

d 

logp(z) = ^stzszt + ^-(6), (10) 

s,t=l 

where B is a symmetric dx d matrix and ^'(B) is a normalizing constant which ensures that the probabil- 



ities add up to one. If we go back to the log-linear interaction model described in section 2.1 with binary 
variables, i.e. the cell i £ {0, 1}'', then by comparing formula ([l| to ( lOi we see that the Ising model 



is a log-linear model whose highest interactions are of order one and the parameterization is, in terms of 
ANOVA, with Helmert instead of poly contrasts. Therefore, the interaction graph builds up by connecting 
the nodes s and t for which (3st ^ 0. Wainwright et al. (|2007j) prove that under certain sparsity assumptions 



their method correctly identifies the underlying graph structure. Note that both our method and Wainwright 



et al. ( 20071) use node-wise regression. The difference is, that we estimate the node neighborhoods by per- 



forming normal regression and hard-thresholding importance measures derived from regression weights. 
We emphasize that the [Wainwright et a l. { 2007) approach works for binary variables only, while our de- 
composition approach explained in section [3] works for general multi -category variables. For arbitrary 
values of the tuning parameter A we abbreviate this method by WW. If the tuning parameter is chosen by 
cross-validation, we abbreviate the method by WW:CV. It turns out, that WW:CV sometimes yields dense 
models. Therefore, we abbreviate by WW:MIN the solution for the minimal A for which the normalisation 
constant could be computed (due to computational limitations connected with the junction tree algorithm). 

Our second method for comparison is Random Forest. As explained in section |3.1| Random Forest 
together with a suitable cutoff on the rank importance matrix R can be used to derive an initial graph esti- 
mate. This yields a graph but no estimation of the log-linear interaction model is provided. We abbreviate 
this method by RF. 

For convenience, in Table [T] we give a short overview of the methods that were defined in this section 
and are going to be used in the simulation study in the next section. 



5 Simulation Study 

We simulate from a log-linear interaction model corresponding to a graph with 40 nodes and 91 edges. 



Each node corresponds to a binary variable (and thus, we can compare with the method in Wainwright 



et al. ( 2007| l, explained in section 4.2 1. The graph is displayed in Fig. [3] This is the same simulation setting 



as was used in |Kim| ( [2005| l. We generate 10 datasets each consisting of 100000 observations according to 



the graph in Fig 
In section 15.1 



we investigate the effect of the maximal clique size 



(for splitting off a clique as 



explained in section 3.2 1 in our decomposition approach with respect to performance in estimating the 
correct model structure. In section 5.2 we compare different approaches with respect to performance in 
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Table 1 Overview of the methods used in the simulation studies. 



Abbreviation 


Description 


DGL 
DGL:CV 
DGL:F 
DSF 

DSF:AIC 

DF 

WW 

WW:CV 

WW:MIN 

RF 


Decomposition approach using group-lasso-penalty without fixing the penalty parameter. 

As DGL but penalty parameter fixed by cross-validation. 

Final result of DGL and hard-thresholding as explained in section 3.3 

Decomposition approach using stepwise forward selection without hxing penalty parameter. 

As DSF but penalty parameter fixed corresponding to AIC. 

Decomposition approach without model selection but using MLE. 


Approach by Wainwright et al. (2007 1 without fixing the penalty parameter. 


As WW but penalty parameter hxed by cross-validation. 

As WW using minimal penalty parameter that is computationally feasible for junction tree. 
Random Forest. 



Fig. 3 Graph from which we simulate. Nodes correspond to binary random variables. 




estimating the correct model structure, while in section 5.3 we compare in terms of accuracy for estimating 
the cell probabilities. 
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5.1 Optimal clique size 

We estimate a model using DGL:CV for s^ax equal to 3, 5 and 10. A ROC curve is shown in Fig. |4j 
where the endpoints of the curves correspond to the selected model of DGL:CV. The curves then build 
up by successively eliminating edges corresponding to the smallest estimated interaction vector coefficient 
0. We see here that larger decomposition sizes lead to sUghtly more favorable ROC curves. The picture 
remains qualitatively the same if we use DSF instead of DGL. For the remainder of the simulations, we 
will keep the maximal clique size in our approach fixed at s„iax = 10. 

Fig. 4 Comparison of decomposition sizes. Decomposition into cliques of maximal size 3, 5 and 10 using DGL:CV. 
The curves corresponds to models which arise by thresholding the final /3-coefficient. 




False Positive Rate 



5.2 Performance for structure estimation 

First, we compare DGL, DSF and DF. The performance for structure estimation is shown in ROC curves 
(see Fig. |5]). For DSF the curve builds up by varying s (compare section 4.1 1. Note, that the starting 



point of the curve of DSF coincides with DF. DGL starts from DGL:CV and uses hard-thresholding of (3 
for obtaining the values on the ROC curve. Note that during this hard-thresholding, the value of DGL:F 
is produced, too. We see that DSF and DGL lead to models which have approximately the same number 
of false positive and false negative edges, but the DGL is slightly favorable. The solution of DF has the 
largest false positive rate (as was expected since no model selection was done). 

Second, in Fig. |6]the alternative approaches WW and RF are compared with DGL. In order to keep a 
simple overview, the results for DSF are no longer shown. We see that our decomposition approach (DGL) 
slightly outperforms the alternative approaches WW and RF. Keep in mind that in addition to the advantage 
of our method in performance, RF does not yield an estimate for the parameter vector and WW does so 
only for binary variables, whereas our method is applicable to general multi-category variables. 
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Fig. 5 Comparison of decomposition approaches DGL, DSF and DF using Smax = 10. The dotted vertical line 
corresponds to the difference in the true positive rate (number of correctly selected edges / number of true edges) for 
the two procedures when they are compared at the false positive rate (number of incorrectly selected edges / number 
of true gaps) of DGL:F. Note that the x-axis is shown only up to 0.5. 
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Table 2 Comparison of true positive rates of different methods at the false positive rate of DGL:F (indicated in Fig. 
[5] and Fig. [6|by vertical lines). 





Mean difference 


p-value (t-test) 


tpi-(DGL) - tpr(DSF) 


0.052 


0.036 


tpi-(DGL) - tpr(WW) 


0.060 


0.011 


tpr(DGL) - tpr(RF) 


0.013 


0.256 



Even though Fig. |5]and Fig. |6]only represent one simulated dataset, we observed a similar picture for 
other simulation settings. The lines for DGL and DSF are always very close, with DGL being slightly 
better and both methods being clearly superior to the global approaches. The reason why Fig. |5]and Fig. 
[6] only display results from one dataset is that the single final models cannot be averaged over different 
datasets as they have different positions on the curves for different datasets (different numbers of true and 
false positives). If we average over all these values, the result is not very meaningful anymore. However, 
we can average the differences of true positive rates for e.g. DGL and DSF where both methods have the 
same number of false positives as the solution of DGL:F (dotted vertical line in Fig. |5]l. The results of 
such comparisons are summarized in Table [2] We see that DGL yields a significantly (for significance 
level OL — 0.05) higher true positive rate than the corresponding solutions of DSF and WW. On the other 
hand, the comparison between the solution of DGL and RF (at the false positive rate of DGL:F) shows no 
significant difference. 
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Fig. 6 Comparison of DGL with WW and RF. For DGL the line builds up by thresholding the /3-coefficient where 
A is chosen by cross-validation. For RF, the edges with least importance are successively eliminated and for WW, the 
tuning parameter A is varied. The grey vertical line indicates the false positive rate of DGL:F at which the true positive 
rates of the other methods are compared in Table |2] 




False Positive Rale 



5.3 Performance for estimation of cell probabilities 

DGL, DSF, DF and WW yield estimates of the parameter vector (3 which can be immediately transformed 
into estimates of cell probabilities. In this section we will compare these methods with respect to the 
performance for estimating the cell probabilities. Recall that RF does not yield an estimate of the parameter 
vector and is therefore not included in this comparison. 

All approaches considered in this section yield the parameter vector (3 only up to a constant. We need 
to ensure that the estimated cell probabilities add up to one. For sparse graphs, we can use the junction 
tree algorithm to calculate the normalising constant for the probabilities. For a detailed description see 
|Lauritzen| ( [T996l l. 

We compare the estimated probabilities, using cross-validation for tuning parameters and using an ex- 
pression which is up to a constant the Kullback-Leibler divergence between the estimated and the true 
probability (non-normalized Kullback-Leibler divergence): 

-^og[\{ffl^=-Y,p.,\ogp,, (11) 

where p is the estimated probability vector and p denotes the true probability vector. As this sum requires 
the calculation of 2''" w lO^'^ components of p and p and the summation of the two huge vectors, this is 
computationally not feasible. To avoid this problem, we calculate an empirical version by simulating one 
million observations from the graph in Fig. [sjand summing over these 10^ values only. 

The results are summarized in Table [3] We see that DGL:F, DSFiAIC and DF perform similarly and 
WW:MIN is clearly inferior For WW:CV, which is indicated in Fig. |6] the normalizing constant could not 
be computed. This is because the CV-optimal solution almost corresponds to the full model and thus, the 
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Fig. 7 Mean empirical (non-normalized) KuUback-Leibler distance between true and estimated probability in de- 
pendence of the percentage of thresholded coefficients. The vertical line indicates the average of percentages of 
thresholded coefficients using the thresholding rule from section [33] 



Kullback-Leibler divergence 

Kullback-Leibler divergence for tliresiiold-optimal solution 
Mean Percentage cf thresholded coefficients 




% of the thresholded R coefficients. 



junction tree algorithm is not feasible. The maximal computable solutions of WW still correspond to very 
large models, which on average involve 22.05% of all possible edges, compared to 17.01% for DGL:F and 
11.66% for the true graph. 

Table |4]provides further insight about significance of the differences in Table [3] All methods are com- 
pared against each other using a paired t-test for the empirical (non-normalized) Kullback-Leibler di- 
vergences and the p-values are reported. One can see that there is no significant (at significance level 
a — 0.05) difference for the decomposition approaches, whereas they are all superior to WW. This pro- 
vides evidence that the decomposition of the model is more crucial than the effective choice of the log- 
linear model fitting procedure afterwards. 

Furthermore, it is worthwhile stating that the thresholding of the coefficients (see section 3.3 1 does 
hardly influence the likelihood as can be seen in Fig. [t] On average, 35% of the coefficients are thresholded 
as indicated by the dotted line. However, for these threshold-optimal solutions, calculated as described in 
section 3.3 the empirical (non-normalized) Kullback-Leibler divergence is approximately the same as for 



the non-thresholded model. 



6 Application to Tissue Microarray Data 

6.1 Tissue Microarray technology 

The central motivation that led to this work was to fit a graphical model to discrete expression levels of 
biomarkers resulting from Tissue Microarray (TMA) experiments. Tissue MicroaiTay technology allows 
rapid visualization of molecular targets in thousands of tissues at a time, either at DNA, RNA or protein 
level. Tissue Microarrays are composed of hundreds of tissue sections from different patients arrayed on a 
single glass slide. With the use of immunohistochemical staining, they provide a high-throughput method 
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Table 3 Mean empirical (non-normalized) Kullback-Leibler divergence between true and estimated probabilities. 
The p- values for all the pairwise comparisons are given in Table |4] 





Mean 


SD 


DGL:F 


4.0080 


0.0204 


DSF:AIC 


4.0242 


0.0217 


DF 


4.0223 


0.0099 


WW:MIN 


4.3360 


0.1094 



Table 4 All possible pairwise comparisons between models: p-values of a paired t-test for the equality of (non- 
normalized) Kullback-Leibler divergence. 





DSF:AIC 


DF 


WW:MIN 


DGL:F 


0.1030 


0.0677 


4.7 IQ-*^ 


DSF:AIC 




0.8080 


6.0 IQ-*^ 


DF 






7.5 10"*^ 



to analyze potential biomarkers on large patient samples. The assessment of the expression level of a 
biomarker is usually performed by the pathologist on a categorical scale: expressed/not expressed, or the 
level of expression. 

Tissue Microarrays are powerful for validation and extension of findings obtained from genomic surveys 
such as cDNA microarrays. cDNA microarrays are useful to analyze a huge number of genes, e.g. a couple 
of thousands in one specimen at a time. In contrast, TMAs are applicable to the analysis of one target at a 
time, denoted as biomarker, but in up to 1000 tissues on each slide. 

The analysis of the interaction pattern of these biomarkers and in particular the estimation of the graph- 
ical model associated with the underlying discrete random variables are of bio-medical importance. These 
graph-based patterns can deliver valuable insight into the underlying biology. A detailed description of the 
Tissue Microarray technology can be found in KalUoniemi et al. (2001 1. 



6.2 Estimation of grapliical model 

Our TMA dataset consists of Tissue Microarray measurements from renal cell carcinoma patients. We 
have information from 1116 patients, 831 thereof having a clear cell carcinoma tumor, which is the tumor 
of interest here. We have identified 18 biomarkers from which we have information for the majority 
of the patients. Among 831 ccRCC (clear cell renal cell carcinoma) observations, 527 observations are 
complete with all biomarker measurements available. For 87 observations one measurement was missing, 
64 and 30 observations had 2 or 3 missing values, respectively. 123 observations contained more than 
3 missing values and were ignored in the analysis. For the observations with 1, 2 or 3 missing values, 
multiple imputation was applied using the R package mice ( |Van Buuren and Oudshoornl[2007j . From 18 
biomarkers, 9 are binary and 9 have 3 levels. 

Using DGL:F, we estimated a graphical model to the TMA data. The resulting graph is displayed in 
Fig. [8] The thickness of the line corresponds to the ^2-norm of the respective interaction coefficients. Two 
biomarkers connected by a thick line, as is the case for nuclear p27 and cytoplasmic p27, indicates a strong 
interaction. The kinase inhibitor p27 exhibits its function in the cell nucleus and therefore recent studies 
have focused on nuclear p27 expression. Our graphical log-linear model however shows a tight association 
between nuclear and cytoplasmic expression of p27. Therefore it can be speculated that both nuclear and 
cytoplasmic presence is required to ensure proper function of p27. It has been shown that in renal tumors. 
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the von Hippel-Lindau protein (VHL protein) is upregulating the expression of the tumor suppressor p27 
HOsipov efoT] [2002^. 



The graphical model here provides supporting evidence that VHL indeed regulates 
p27, and the corresponding (3 coefficient (not displayed here) implies that it is a positive regulation. 

Furthermore it has been shown in vitro by Roe et al. (2006 1 that VHL increases p53 expression which 
is a tumor suppressor. In our model it seems as if p53 is conditionally independent of VHL. Indeed, it has 
long been known that p53 activates expression of p21 (e.g. Kim ( 1997| ). This dependence is displayed 



very clearly in the graphical model. We can therefore view the p53-p21 pathway with its strong interaction 
as one unit and it is therefore very reasonable that nuclear VHL interacts with p53. As nuclear VHL is only 
expressed in 14% of the tumors, and it further makes sense from a biological point of view that the strong 
interaction between VHL and the p21-p53 pathway is in fact a causal relation, we can indeed speculate 
that the loss of VHL deactivates the tumor suppressor p53 which in turn favors tumor development. 



et al. 



CA9, Glutl and Cyclin Dl are all hypoxia-inducible transcription factor (HIF) target genes (Wenger 
2005| l. HIF has not been measured but we can clearly see that all these HIF targets are connected 



by a rather thick line implying that they might react to a common gene. In addition, CD 10 strongly 
interacts with Glutl a known HIF target which suggests that CDIO might also be regulated by HIF. The 



reduction of E-Cadherin expression has been found to be negatively correlated with HIF expression in Imai 



et al. (2003 1. This is supported by a strong negative interaction between E-Cadherin and CA9 which is 



positively correlated with HIF expression (not measured). 

A lot of supporting evidence has been delivered for already existing theories. However, two strong 
interactions, one between PAX2 and nuclear p21 and the other between PAX2 and Cyclin Dl cannot be 
immediately explained. PAX2 is absent in normal renal tubular epithelial cells but expressed in many 
clear cell renal cell carcinoma tumours (see Mazal et al. ( 2005) ). Its frequent expression together with the 
strong interaction with the p21-p53 pathway, Cychn Dl and PTEN make PAX2 an interesting and possibly 
important molecular parameter whose exact function and role still remains to be elucidated. 

A more detailed discussion of bio-medical implications is given in Dahinden et al. (2009 1. 



7 Discussion 

We have proposed a decomposition approach to estimating log-linear models for large contingency tables 
and for fitting discrete graphical models. In a simulation study we have compared various algorithms and 
concluded that our procedures are very competitive. It seems that the decomposition of the problem is 
much more crucial than the choice of the algorithm to handle the smaller decomposed datasets: no matter 
whether DGL, DSF or DF is applied, the resulting models are superior to non-decomposition approaches 
such as WW and RF for model selection as well as for probability or parameter estimation. 

Maybe most important is the computational feasibility of our procedure for large contingency tables. 
The proposed method is scalable to orders of realistic complexity (e.g. dozens up to hundreds of factors) 
where most or all other existing algorithms become infeasible. In particular, our procedure is not only 
capable of handling binary data but can easily deal with factors with more levels. Furthermore, with DGL 
one doesn't risk the nonexistence of the parameter estimator in case of sampling zeroes in the contingency 
table as this might arise when using the MLE. Our procedure not only fits a graphical model but also yields 
an estimation of the parameter vector /3 in a log-linear model and therefore of the cell probabilities. All this 
is achieved with good performance in comparison to other methods. As a drawback, if the true underlying 
graph has a clique which is larger than our decomposition size, then some of the edges in the graph are 
necessarily lost. 

We apply the proposed approach to a problem in molecular biology and we find supporting evidence for 
dependencies between biomarkers which have already been found to exist in vitro or some even in renal 
tumors, the domain of our application. However, some strong interactions cannot be explained immediately 
and therefore, new biological hypotheses arise. 
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Fig. 8 Estimated graphical model from Tissue Microarray data 




The R package LLdecomp for computing a decomposition as described in this paper (i.e., recursively 
finding cliques and separators) is available at http: / /www. r-pro ject . org. 
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